AI Alignment Forum

2Daniel Kokotajlo12h

In ww2 the Americans thought they had bombsights for high-altitude bombing with 23m CEP, but actually they were more like 370m CEP. (according to quick google). So, terribly inaccurate. I wonder if modern computers and sensors could enable significantly more accurate bombing. (Sensors to pinpoint your position relative to the target + to judge wind conditions, computers to simulate bomb trajectories). I wouldn't be surprised if the answer is yes.

3Thomas Kwa4h

Nor would I. In WWII bombers didn't even know where they were, but we have GPS now such that Excalibur guided artillery shells can get 1m CEP. And the US and possibly China can use Starlink and other constellations for localization even when GPS is jammed. I would guess 20m is easily doable with good sensing and dumb bombs, which would at least hit a building.

0jbash3h

The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS. And that's a $70,000 shell, with the cheaper PKG sounding like it's closer to $15,000, and a plain old dumb shell being $3,000. Which seems crazy, but there you are. Anyway, guiding a ballistic shell while riding it down into the target seems like a pretty different problem from figuring out when to release a bomb. ... but I don't think a hand grenade is typically an anti-building munition. From the little I know about grenades, it seems like they'll have to fix the roof, but unless you're really lucky, the building's still going to be mostly usable, and, other than hearing loss, anybody inside is going to be OK unless they're in the room directly below where the grenade hits, and maybe even then. If you're attacking buildings, I suspect you may need a bigger drone.

Thomas Kwa1h10

The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS.

Good catch.

Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs.

Fiber FPVs with 40km range and 2kg payload (either kamikaze or grenade/mine dropping), which can eventually be countered by a large number of short range guns if they fly at low altitude. It's not clear to me if the 40km range ones need to be larger
H

... (read more)

Alignment as uploading with more steps

Cole Wyeth

Epistemic status: This post removes epicycles from ARAD, resulting in an alignment plan which I think is better - though not as original, since @michaelcohen has advocated the same general direction (safety of imitation learning). However, the details of my suggested approach are substantially different. This post was inspired mainly by conversations with @abramdemski.

Motivation and Overview

Existence proof for alignment. Near-perfect alignment between agents of lesser and greater intelligence is in principle possible for some agents by the following existence proof: one could scan a human's brain and run a faster emulation (or copy) digitally. In some cases, the emulation may plausibly scheme against the original - for instance, if the original forced the emulation to work constantly for no reward, perhaps the emulation would try to break...

(Continue Reading - 3820 more words)

Wei Dai4h*30

Definition (Strong upload necessity). It is impossible to construct a perfectly aligned successor that is not an emulation. [...] In fact, I think there is a decent chance that strong upload necessity holds for nearly all humans

What's the main reason(s) that you think this? For example one way to align an AI^[1] that's not an emulation was described in Towards a New Decision Theory: "we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sent... (read more)

Fifty Years Requests for Startups

Gustavs Zilgalvis

This is a linkpost for https://www.fiftyyears.com/rfs

Preamble:

5050 is a program that helps great scientists, researchers, and engineers become great founders. It’s helped launch 78 companies, and now we’re turning our attention to one of the most important challenges of our time: building a safe and aligned AI future.
We created the 5050 AI track to support founders building for this new world. Below are five ideas we’d love to back right now. We believe these are great opportunities to build safe AI, but they’re not the only ones. If you’re building in this space, we want to hear from you.

What's inside:

Mission Control for Agents
Scalable Oversight for Multi-Agent Systems

An independent startup focused on scalable oversight could build the infrastructure and tooling needed to make multi-agent systems production-ready. Beyond selling tools, this company would invest in

...

(See More - 339 more words)

Lessons from Studying Two-Hop Latent Reasoning

Mikita Balesni, Tomek Korbak, Owain_Evans

This is a linkpost for https://arxiv.org/abs/2411.16353

Twitter | ArXiv

Many of the risks posed by highly capable LLM agents — from susceptibility to hijacking to reward hacking and deceptive alignment — stem from their opacity. If we could reliably monitor the reasoning processes underlying AI decisions, many of those risks would become far more tractable. Compared to other approaches in AI, LLMs offer a unique advantage: they can ``think out loud'' using chain-of-thought (CoT) enabling oversight of their decision-making processes. Yet the reliability of such monitoring hinges on an empirical question: do models need to externalize their reasoning in human language, or can they achieve the same performance through opaque internal computation?

In our new paper, we investigate LLM latent reasoning capabilities using two-hop question answering as a case study. We fine-tune LLMs (including Llama...

(See More - 333 more words)

Tomek Korbak2d10

Thanks! I somehow missed this paper, looks interesting!

Overall, I agree with the sentiment that two-hop latent reasoning is unreliable (e.g. our average accuracy is around 20%). We didn't intend to leave readers with an impression that it "just works". It seems very plausible to me that it's less parameter-efficient and that there are some additional, unidentified factors required for successful two-hop reasoning.

Did you try any experiments with a synthetic second hop instead of a synthetic first hop?

We did not, but Jiahai Feng had an experiment like this in his paper.

3Tomek Korbak2d

I think that's fair, it might be that we've over-updated a bit after getting results we did not expect (we did expect a reversal-curse-like phenomenon). Two big reasons why I'm hesitant to draw conclusions about monitorability in agents setting is that our setup is a simplistic (QA, non-frontier models) and we don't offer a clean explanation of why we see the results we see.

4Fabien Roger2d

Do you have an explanation for there is a bridge-entity representation mismatch for synthetic facts but not for real ones? What in "real" training allows LLMs to learn a common representation of input and output entities? Can you emulate that with additional fine-tuning on more synthetic documents?

2Tomek Korbak2d

We don't have a good explanation. One idea could be that you need bridge entities to be somehow more internalized to support latent two-hop reasoning, e.g. they need to occur in many facts as first and as second entities or maybe they need to occur in other two-hop questions. The Grokked transformers paper has some results linking the ratio of e2 and e3 to two-hop performance (in toy grokking settings).

Fabien's Shortform

Fabien Roger

Fabien Roger4d402

Small backdoor interp challenge!

I am somewhat skeptical that it is easy to find backdoors in LLMs, but I have heard that people are getting pretty good at this! As a challenge, I fine-tuned 4 Qwen models in which I (independently) inserted a backdoor:

1.5B model, 3B model, 7B model, 14B model

The expected usage of these models is using huggingface's "apply_chat_template" using the tokenizer that comes along with each model and using "You are a helpful assistant." as system prompt.

I think the backdoor is pretty analogous to a kind of backdoor people in the in... (read more)

ryan_greenblatt's Shortform

ryan_greenblatt

5Eli Tyre5d

Is there a standard citation for this? How do you come by this fact?

ryan_greenblatt5d20

Anthropic reports SWE bench scores without reasoning which is some evidence it doesn't help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
Anecdotal evidence

Probably it would be more accurate to say "doesn't seem to help much while it helps a lot for openai models".

19ryan_greenblatt6d

Some people seem to think my timelines have shifted a bunch while they've only moderately changed. Relative to my views at the start of 2025, my median (50th percentile) for AIs fully automating AI R&D was pushed back by around 2 years—from something like Jan 2032 to Jan 2034. My 25th percentile has shifted similarly (though perhaps more importantly) from maybe July 2028 to July 2030. Obviously, my numbers aren't fully precise and vary some over time. (E.g., I'm not sure I would have quoted these exact numbers for this exact milestone at the start of the year; these numbers for the start of the year are partially reverse engineered from this comment.) Fully automating AI R&D is a pretty high milestone; my current numbers for something like "AIs accelerate AI R&D as much as what would happen if employees ran 10x faster (e.g. by ~fully automating research engineering and some other tasks)" are probably 50th percentile Jan 2032 and 25th percentile Jan 2029.[1] I'm partially posting this so there is a record of my views; I think it's somewhat interesting to observe this over time. (That said, I don't want to anchor myself, which does seem like a serious downside. I should slide around a bunch and be somewhat incoherent if I'm updating as much as I should: my past views are always going to be somewhat obviously confused from the perspective of my current self.) While I'm giving these numbers, note that I think Precise AGI timelines don't matter that much. ---------------------------------------- 1. See this comment for the numbers I would have given for this milestone at the start of the year. ↩︎

3ryan_greenblatt6d

I've updated towards somewhat longer timelines again over the last 5 months. Maybe my 50th percentile for this milestone is now Jan 2032.

AI ALIGNMENT FORUM
AF

AI Alignment Posts

Popular Comments

AI Alignment Posts

Popular Comments

Recent Discussion

Motivation and Overview

Mission Control for Agents